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Claims 

What is claimed is: 

1 . A system to facilitate building a statistical model for a data set, comprising: 
a first training algorithm operative to efficiently build a first model for each 

subset of the data set; 

an evaluation function operable to determine whether a subset of the data set 
for which the respective first model was built is an appropriate subset of the data set; and 

a second training algorithm operable to build a second model based on the 
appropriate subset of the data set, the second training algorithm being more accurate than the 
furst training algorithm. 

2. The system of claim 1 , further comprising a data scheduler which, based on a 
data policy, is operative to control the size of subsets for which the first training algorithm is 
apphed. 

3. The system of claim 2, wherein the data scheduler is operable to increase the 
size of the subset to provide a larger aggregate subset of the data set if the first model is 
unacceptable, the first training algorithm being operative to efficiently build the first model 
for each larger aggregate subset of the data until the evaluation function determines the 
resulting first model to be acceptable. 

4. The system of claim 3, wherein the acceptability of each first model is 
determined based on a stopping criterion functionally related to an expected incremental 
benefit and a cost associated with increasing the size of the aggregate subset of the data set. 

5. The system of claim 4, wherein the cost of the stopping criterion is 
functionally related to at least one of time associated with evaluating an aggregate data subset 
of increased size and size of the aggregated subset of the data. 
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6. The system of claim 4, wherein the stopping criterion is defined by 

( KD,,\0(DJ)~liD,,\0{D^_^)) 1 1 



where 

/(DHo|9(Dn)) is a log likelihood for holdout data evaluated for the model built 
by the first training algorithm on a current subset of the training data set, 

/(DHo|6(Dn-i)) is a log likelihood for holdout data evaluated for the model 
buih by the first training algorithm on a previous subset of the training data set, 

/(DHo|0base(Dn)) is a log likeUhood for holdout data evaluated for a base 

model, 

ci, C2, and cs are constants determined based on appHcation of the second 
training algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second training algorithm, when appHed to 
the first subset, 

n 1=1 

Ji is the number of iterations for the first training algorithm when appUed to a 
data subset D^^ 

I Dn+i| is the size of data set Dn+i, 

I ADn+i| is the increment in size | Dn+i| - 1 Dni, 

A. is a user determined stopping threshold . 

7. The system of claim 4, wherein the stopping criterion is defined by 



iiDHo\m))-KDHo\mj) 



where 

/(DHo|9(Dn)) is a log Hkelihood for holdout data evaluated for the model built 
by the first training algorithm on a current subset of the training data set, 

/(DHo|0(Dn-i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 
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/(DHo|6base(Dii)) is a log likelihood for holdout data evaluated for a base 

model, 

§ is an offset associated with a difference in log likelihood for holdout data 
when evaluated for models built on a first subset of the training data set by the 
respective first and second training algorithms, 

Ci, C2, and 03 are constants determined based on application of the second 
training algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second training algorithm, when applied to 
the first subset, 

- 1 

J, =-5]/^., and 

Ji is the number of iterations for the first training algorithm when applied to a 
data subset Di, 

I Dn+ii is the size of data set D^+i, 

I ADn+i| is the increment in size | Dn+i| - 1 Dn|, and 

A. is a user determined stopping threshold. 



8. The system of claim 1, wherein the first training algorithm further comprises 
an iterative algorithm, which is operative to build the model for the subset of the data set 
according to an associated training pohcy. 

9. The system of claim 8, wherein the first training algorithm further comprises 
an associated training policy that defines parameter initialization of the first training 
algorithm for each subset of the data set. 

1 0. The system of claim 9, wherein the training policy associated with the first 
training algorithm fiirther controls parameter initiaHzation of the first training algorithm, such 
that at least some of the parameters computed for a previous subset of the data are employed 
to initialize the first training algorithm for a subsequent larger aggregate subset of the data. 
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1 1 . The system of claim 9, wherein the first training algorithm is initiaUzed by the 
same parameter values for each subset of the data subset. 

12. The system of claim 9, wherein the training policy sets the iterative algorithm 
to perform a fixed number of at least one iteration. 

13. The system of claim 12, wherein the training policy sets the iterative 
algorithm to perform a single iteration. 

14. The system of claim 12, wherein the second training algorithm further 
comprises an iterative algorithm that operates according to an associated training poUcy, so 
as to produce a more accurate model for the appropriate subset of the data set than the first 
training algorithm. 

15. The system of claim 14, wherein the iterative algorithm associated with at 
least one of the first and second training algorithms is an Expectation and Maximization 
algorithm. 

16. The system of claim 8, wherein the training policy associated with the 
iterative algorithm of the first training algorithm controls the iterative algorithm to run until 
an associated convergence criterion is satisfied. 

17. The system of claim 16, wherein second training algorithm further comprises 
an iterative algorithm, which is operative to build the model for the appropriate subset of the 
data set according to an associated training policy, 

18. The system of claim 1 7, wherein the training policy associated with the 
iterative algorithm of the second training algorithm controls the respective iterative algorithm 
to run until an associated convergence criterion is satisfied, wherein the convergence 
criterion associated with the second training algorithm provides improved model quality 
relative to the convergence criterion associated with the first training algorithm. 
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19. A system programmed to facilitate building a statistical model, comprising: 

a first parameter estimation algorithm operable to efficiently build models for 
subsets of a data set based on a training policy associated therewith; and 

an evaluation function operable to determine whether a subset of data for 
which the model was built is an appropriate size for building the statistical model to 
characterize the data set; 

a second parameter estimation algorithm operable to build a model on a subset 
of the data set having the appropriate size, the second parameter estimation algorithm having 
an associated training policy, which enables the second parameter estimation algorithm to 
build a more accurate model than the first parameter estimation algorithm. 

20. The system of claim 19, further comprising a data scheduler operable to 
increase the size of the subset of the data set to provide a larger aggregate subset of the data 
set if the model is unacceptable, the first parameter estimation algorithm being operative to 
efficiently build a model for each larger aggregate subset until a resulting model built 
therefirom is determined to be acceptable. 

21. The system of claim 19, wherein the first parameter estimation algorithm 
fiuther comprises an iterative algorithm operative to build the model for each subset of the 
data set according to the associated training policy. 

22. The system of claim 2 1 , wherein the training policy for the first parameter 
estimation algorithm is operative to control parameter initialization for the first parameter 
estimation algorithm, such that at least some of the parameters computed for a previous 
subset of the data are employed to initiahze the first parameter estimation algorithm for a 
subsequent larger aggregate subset of the data set. 
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23. The system of claim 21, wherein the first parameter estimation algorithm is 
initiaUzed by the same parameter values for each subset of the data subset. 

24. The system of claim 21, wherein the training poUcy associated with first 
parameter estimation algorithm controls the iterative algorithm of the first parameter 
estimation algorithm to perform a fixed number of at least one iteration, the second training 
algorithm fiirther comprising an iterative algorithm, which is operative to perform a greater 
number of iterations than the iterative algorithm of the first training algorithm based on a 
training pohcy associated with the second parameter estimation algorithm. 



25. The system of claim 21, wherein the training policy associated with the 
iterative algorithm of the first parameter estimation algorithm controls the iterative algorithm 
to run until an associated convergence threshold is satisfied, wherein the second training 
algorithm further comprises an iterative algorithm, the training poKcy associated with the 
iterative algorithm of the second parameter estimation algorithm being operative to control 
the respective iterative algorithm to run until an associated convergence threshold is satisfied, 
the convergence threshold associated with the second parameter estimation algorithm is less 
than the convergence threshold associated with the first parameter estimation algorithm, 

26. The system of claim 19, wherein the evaluation fimction determines whether 
the subset of data for which the model was built is an appropriate size based on a stopping 
criterion, which is functionally related to an expected incremental benefit and an expected 
incremental cost associated with increasing size of the subset of data. 



27. The system of claim 26, wherein the cost of the stopping criterion is 
fimctionally related to at least one of time associated with evaluating the model for a larger 
subset of data and size of the larger subset of the data. 



28. The system of claim 26, wherein the stopping criterion is defined by 

( KD^,\0(Djyi(D^,\0(D„_,)) 



1 
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where 

/(DHo|0(Dn)) is a log likelihood for holdout data evaluated for the model built 
by the first training algorithm on a current subset of the training data set, 

/(DHoi9(Dn-i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(DHolQbaseCDn)) is a log likelihood for holdout data evaluated for a base 

model, 

ci, 02, and C3 are constants determined based on apphcation of the second 
parameter estimation algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second parameter estimation algorithm, 
when applied to the first subset, 

— 1 " 

Ji is the number of iterations for the first parameter estimation algorithm when 
applied to a data subset Di, 

1 Dn+i| is the size of data set Dn+i, 

I ADn+i| is the increment in size | Dn+i| - 1 Dn|, and 

X is a user determined stopping threshold. 

29. The system of claim 26, wherein the stopping criterion is defined by 



KDHo\m))-KDHo\0iD.-.)) 



where 

^(DHo|6(Dn)) is a log likelihood for holdout data evaluated for the model built 
by the first training algorithm on a current subset of the training data set, 

/(DHo|6(Dn-i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

^(DHo|Obase(Dn)) is a log hkchhood for holdout data evaluated for a base 

model. 



27 



MS158346.1 



5 is an offset associated with a difference in log likelihood for holdout data 
when evaluated for models built on a first subset of the training data set by the 
respective first and second training algorithms, 

cij C2, and C3 are constants determined based on application of the second 
parameter estimation algorithm relative to a first data subset of the data set, 

I] is a number of iterations for the second parameter estimation algorithm, 
when apphed to a first data subset, 

— 1 " 

Ji is the number of iterations for the first parameter estimation algorithm when 
applied to a data subset Di^ 

I Dn+i| is the size of data set Dn+i, 

I ADn+i| is the increment in size | Dn+i| - 1 Dn|, and 

X is a user determined stopping threshold. 

30. A leaming curve method to facilitate building a statistical model, comprising: 
choosing a subset of a data set; 

employing a first training algorithm to build a first model to characterize the 

subset; 

evaluating the first model; 

if the first model is unacceptable, repeatedly increasing the size of the subset 
of data to provide an aggregate data set, building another first model to characterize the 
aggregate subset, and reevaluating the model; and 

if the model is acceptable, employing a second training algorithm to build a 
second model based on the aggregate data set, the second training algorithm being different 
fi-om the first training algorithm. 

3 1 . The method of claim 30, fiirther comprising determining the acceptabihty of 
each first model based on a stopping criterion fimctionally related to an expected incremental 
benefit and an expected incremental cost associated with increasing the size of the aggregate 
subset of the data set. 
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32. The system of claim 31, wherein the cost of the stopping criterion is 
functionally related to at least one of time associated with evaluating an aggregate data subset 
of increased size and size of the aggregate subset of the data. 
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33. The system of claim 31, wherein the stopping criterion is defined by 

( l(D,,\e(DJ)-l{D,,\d(D^_,)) ] 1 



where 

/(DHo|9(Dn)) is a log likelihood for holdout data evaluated for the model built 
by the first training algorithm on a current subset of the training data set, 

/(DHo|9(Dn.i)) is a log likehhood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(DHo|0base(Dn)) IS a log likelihood for holdout data evaluated for a base 

model, 

ci, C2, and C3 are constants determined based on application of the second 
parameter estimation algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second parameter estimation algorithm, 
when applied to the first subset, 

- 1 " 

Ji is a number of iterations for the first parameter estimation algorithm when 
appUed to a data subset D^, 

I Dn+i| is a size of data set Dn+i, 

I ADn+i| is an increment in size | Dn+i| - 1 Dn|, and 

A. is a user determined stopping threshold. 

34. The system of claim 3 1 , wherein the stopping criterion is defined by 



^ l{D,o\9{DJ)-l{DMDn^d) ^ 



where 

/(DHo|G(Dn)) is a log likelihood for holdout data evaluated for the model built 
by the first training algorithm on a current subset of the training data set, 
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/(DHo|0(Dn-i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 
/(DHo|9base(Dn)) is a log likelihood for holdout data evaluated for a base 

model, 

5 is an offset associated with the difference in log likelihood for holdout data 
when evaluated for models built on a first subset of the training data set by the 
respective first and second training algorithms, 

ci, C2, and C3 are constants determined based on application of the second 
parameter estimation algorithm relative to a first data subset of the data set, 

Ii is a number of iterations for the second parameter estimation algorithm, 
when apphed to a first data subset. 




Ji is a number of iterations for the first parameter estimation algorithm when 
applied to a data subset Di, 

I Dn+i| is a size of data set Dn+i, 

I ADn+i| is an increment in size | Dn+i| - 1 Dn|, and 

A, is a user determined stopping threshold. 

35. The method of claim 30, wherein the first training algorithm is more 
computationally efficient than the second training algorithm. 

36. The method of claim 30, wherein each instance of model building repeated 
until obtaining an acceptable model by the first training algorithm employs more efficient 
and less accurate model building than model building employed by the second training 
algorithm that occurs afl;er obtaining the acceptable model. 

37. The method of claim 36, wherein each instance of model building repeated 
until obtaining an acceptable model employs the first training algorithm as an iterative 
algorithm that is run to a first convergence criterion, the second training algorithm employing 
an iterative algorithm that is run to a second convergence criterion, which demands more 
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iterations than the first convergence criterion in order to obtain convergence, so that the 
second model is more accurate than the first model built by the first training algorithm. 

38. The method of claim 36, wherein each instance of model building repeated 
until obtaining an acceptable model employs an iterative algorithm having a fixed number of 
at least one iteration, the second training algorithm employing an iterative algorithm having a 
greater number of iterations than the fixed number. 

39. The method of claim 30, further comprising controlling parameter 
initiaUzation employed in each instance of building a model for the aggregate data set prior 
to obtaining an acceptable model 

40. The method of claim 39, further comprising initializing the first training 
algorithm by the same parameter values for each subset. 

41 . The method of claim 39, wherein the controlling fiirther comprises reusing at 
least some of the parameters computed fi:om a previous instance of model building to 
initialize a subsequent instance of model building for a subsequent larger aggregate data set 
prior to obtaining an acceptable model. 

42. A computer-readable medium having computer-executable instructions for: 
choosing a subset of a data set; 

building a model to characterize the subset based on an associated training 

pohcy; 

evaluating the model; 

if the model is unacceptable, repeatedly increasing the size of the subset of 
data to provide an aggregate data set, building a model to characterize the aggregate subset 
based on an associated training policy, and reevaluating the model; and 

if the model is acceptable, employing the aggregate data set to build a 
corresponding model based on an associated training policy, the training poUcy associated 
with the model building repeated until obtaining an acceptable model being more 
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computationally efficient than the training poHcy associated with model building subsequent 
thereto. 

43. The method of claim 42, further comprising determining the acceptabihty of 
the model based on an expected incremental benefit relative to an expected incremental cost 
associated with increasing the size of the aggregate data set. 

44. A method to facihtate constructing a statistical model, comprising: 
separating data into holdout data and training data; 

determining a data subset from the training data by estimating model 
parameters according to a first training policy and evaluating the estimated model parameters 
relative to the holdout data set and repeating the estimation and evaluation of model 
parameters with a larger subset of the training data until an acceptable quahty of the 
estimated model is estabhshed; and, 

subsequent to establishing the acceptable quality of the estimated model, 
using the determined data subset to improve the estimated model parameters by employing a 
second training policy that is more accurate than the first training policy. 

45. The method of claim 44, wherein each estimation of model parameters 
repeated until the acceptable quality of the estimated model is established further comprises 
employing an iterative algorithm that is run until a first convergence criterion is satisfied, the 
estimation of model parameters using the determined data subset further comprising an 
iterative algorithm that is run until a second convergence criterion is satisfied, which is 
operative to provide a better quality of model than the first convergence criterion. 

46. The system of claim 45, wherein the first convergence criterion causes the 
associated iterative algorithm to run until a first convergence threshold is satisfied, wherein 
the second convergence criterion causes the associated iterative algorithm to run until a 
second convergence threshold is satisfied, the second convergence threshold being less than 
the first convergence threshold. 
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47. The method of claim 45, wherein at least one of the iterative algorithm run to 
the first convergence criterion and the iterative algorithm run to the second convergence 
criterion is an Expectation and Maximization algorithm. 

48. The method of claim 44, wherein each estimation of model parameters 
repeated until the acceptable quality of the estimated model is established employs an 
iterative algorithm having a fixed number of at least one iteration, the estimation of model 
parameters using the determined data subset further employing an iterative algorithm having 
a greater number of iterations than the fixed number. 

49. The method of claim 44, further comprising controlling parameter 
initialization employed in each estimation of model parameters repeated until determining an 
acceptable size for the determined data subset. 

50. The method of claim 44, wherein the controlling further comprises reusing at 
least some of the parameters computed from a previous estimation of model parameters to 
initiaUze a subsequent estimation of model parameters for a next larger subset of the training 
set. 

5 1 . The method of claim 44, wherein each estimation of model parameters 
repeated until the acceptable quality of the estimated model is established further comprises 
initiaUzing the first training algorithm by the same parameter values. 

52. The method of claim 44, further comprising determining the acceptabiUty of 
the estimated model based on an expected incremental benefit relative to a cost associated 
with increasing the size of the subset of the data set. 

53. A computer-readable medium having computer-executable instructions for: 
separating data into holdout data and training data; 

determining a data subset from the training data by estimating model 
parameters according to a first training policy and evaluating the estimated model parameters 
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relative to the holdout data set and repeating the estimation and evaluation of model 
parameters with a next successively larger subset of the training data set until an acceptable 
quality of the estimated model is established; and 

subsequent to establishing the acceptable quality of the estimated model, 
using the determined data subset to improve the estimated model parameters by employing a 
second training policy that is more accurate than the first training policy. 

54. A method to facilitate constructing a statistical model, comprising: 
separating data into a holdout data set and a training data set; 
iteratively estimating model parameters for a subset of the training data set 

over a fixed number of iterations and evaluating the estimated model parameters relative to 
the holdout data set; 

repeating the estimation and evaluation of model parameters obtained with 
successively larger subsets of the training data set until an acceptable model quahty is 
established; and 

after the acceptable model quality is estabhshed, iteratively estimating model 
parameters for the data subset, which provided the acceptable model quality, until a better 
quality of model is provided relative to a preceding estimation performed over the fixed 
number of iterations. 

55. The method of claim 54, wherein at least one of the iterative estimations 
employs an Expectation and Maximization algorithm. 

56. The method of claim 54, wherein the estimation that occurs after the 
acceptable model quality is estabhshed, further comprises employing an iterative algorithm 
having a greater number of iterations than the fixed number. 

57. The method of claim 54, wherein the estimation of model parameters after the 
acceptable model quality has been established fiirther comprises employing an iterative 
algorithm that is run until a convergence criterion is satisfied, which is operative to provide a 
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better quality of model with the data subset than a preceding estimation employing the fixed 
number of iterations. 

58. The method of claim 54, further comprising controlling parameter 
initiahzation for each estimation of model parameters that occurs before the acceptable 
model quality has been established. 

59. The method of claim 58, wherein each iterative estimation until the acceptable 
model quaUty is established further comprises initializing the first training algorithm by the 
same parameter values. 

60. The method of claim 58, wherein the controlling further comprises reusing at 
least some of the parameters obtained in a previous estimation of model parameters to 
initiahze a subsequent estimation of model parameters for a next larger subset of the training 
data set. 

61. The method of claim 54, further comprising determining the acceptabihty of 
the model based on an expected incremental benefit relative to an expected incremental cost 
associated with an increase in size of each larger training subset of the data set. 

62. A method to facilitate constructing a statistical model, comprising: 
separating data into a holdout data set and a training data set; 
iteratively estimating model parameters for a subset of the training data set 

until a first convergence threshold is satisfied and evaluating the estimated model parameters 
relative to the holdout data set; 

repeating the estimation and evaluation of model parameters obtained with 
successively larger subsets of the training data set until determining a size of data subset that 
provides acceptable model parameters; and 

after determining the size of data subset that provides acceptable model 
parameters, iteratively estimating model parameters for a data subset of the acceptable size 
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until a second convergence threshold is satisfied, the second convergence threshold being 
less than the first convergence threshold. 

63. A system to facilitate building a statistical model for a data set, comprising: 
first means for building a first model to characterize a subset of the data set; 
evaluation means for evaluating the acceptability of the model, the first means 

building another first model for a larger subset of the data if the evaluation means determines 
that a prior first model is unacceptable; and 

second means, which is different from the first means, for building a second 
model to characterize an aggregate subset of data that enabled the first means to produce an 
acceptable model. 

64. A system to facilitate building a statistical model for a data set, comprising: 
first means for estimating model parameters from a subset of the data set ; 
means for evaluating the estimated model parameters relative to a holdout set 

of the data set; 

means for determining a data subset fi:om the training data by causing the first 
means and the means for evaluating to respectively repeat estimation and evaluation of 
model parameters with a next successively larger subset of the training data set until an 
acceptable quality of the model parameters is established; and 

second means for estimating model parameters based on the determined data 
subset to provide a more accurate estimation of model parameters than the first means. 



37 



